Northern Harbour District
Radiometer Calibration using Machine Learning
Leeney, S. A. K., Bevins, H. T. J., Acedo, E. de Lera, Handley, W. J., Kirkham, C., Patel, R. S., Zhu, J., Molnar, D., Cumner, J., Anstey, D., Artuc, K., Bernardi, G., Bucher, M., Carey, S., Cavillot, J., Chiello, R., Croukamp, W., de Villiers, D. I. L., Ely, J. A., Fialkov, A., Gessey-Jones, T., Kulkarni, G., Magro, A., Meerburg, P. D., Mittal, S., Pattison, J. H. N., Pegwal, S., Pieterse, C. M., Pritchard, J. R., Puchwein, E., Razavi-Ghods, N., Roque, I. L. V., Saxena, A., Scheutwinkel, K. H., Scott, P., Shen, E., Sims, P. H., Spinelli, M.
Radiometers are crucial instruments in radio astronomy, forming the primary component of nearly all radio telescopes. They measure the intensity of electromagnetic radiation, converting this radiation into electrical signals. A radiometer's primary components are an antenna and a Low Noise Amplifier (LNA), which is the core of the ``receiver'' chain. Instrumental effects introduced by the receiver are typically corrected or removed during calibration. However, impedance mismatches between the antenna and receiver can introduce unwanted signal reflections and distortions. Traditional calibration methods, such as Dicke switching, alternate the receiver input between the antenna and a well-characterised reference source to mitigate errors by comparison. Recent advances in Machine Learning (ML) offer promising alternatives. Neural networks, which are trained using known signal sources, provide a powerful means to model and calibrate complex systems where traditional analytical approaches struggle. These methods are especially relevant for detecting the faint sky-averaged 21-cm signal from atomic hydrogen at high redshifts. This is one of the main challenges in observational Cosmology today. Here, for the first time, we introduce and test a machine learning-based calibration framework capable of achieving the precision required for radiometric experiments aiming to detect the 21-cm line.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Arizona (0.04)
- (10 more...)
An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
Hao, Yuren, Wan, Xiang, Zhai, ChengXiang
In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
- North America > United States > Illinois > Champaign County > Urbana (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > District of Columbia > Washington (0.04)
- (6 more...)
Challenging the Abilities of Large Language Models in Italian: a Community Initiative
Nissim, Malvina, Croce, Danilo, Patti, Viviana, Basile, Pierpaolo, Attanasio, Giuseppe, Musacchio, Elio, Rinaldi, Matteo, Borazio, Federico, Francis, Maria, Gili, Jacopo, Scalena, Daniel, Altuna, Begoña, Azurmendi, Ekhi, Basile, Valerio, Bentivogli, Luisa, Bisazza, Arianna, Bolognesi, Marianna, Brunato, Dominique, Caselli, Tommaso, Casola, Silvia, Cassese, Maria, Cettolo, Mauro, Collacciani, Claudia, De Cosmo, Leonardo, Di Buono, Maria Pia, Esuli, Andrea, Etxaniz, Julen, Ferrando, Chiara, Fidelangeli, Alessia, Frenda, Simona, Fusco, Achille, Gaido, Marco, Galassi, Andrea, Galli, Federico, Giordano, Luca, Goffetti, Mattia, Gonzalez-Dios, Itziar, Gregori, Lorenzo, Grundler, Giulia, Iannaccone, Sandro, Jiang, Chunyang, La Quatra, Moreno, Lagioia, Francesca, Lo, Soda Marem, Madeddu, Marco, Magnini, Bernardo, Manna, Raffaele, Mercorio, Fabio, Merlo, Paola, Muti, Arianna, Nastase, Vivi, Negri, Matteo, Onorati, Dario, Palmieri, Elena, Papi, Sara, Passaro, Lucia, Pensa, Giulia, Piergentili, Andrea, Potertì, Daniele, Puccetti, Giovanni, Ranaldi, Federico, Ranaldi, Leonardo, Ravelli, Andrea Amelio, Rosola, Martina, Ruzzetti, Elena Sofia, Samo, Giuseppe, Santilli, Andrea, Santin, Piera, Sarti, Gabriele, Sartor, Giovanni, Savoldi, Beatrice, Serino, Antonio, Seveso, Andrea, Siciliani, Lucia, Torroni, Paolo, Varvara, Rossella, Zaninello, Andrea, Zanollo, Asya, Zanzotto, Fabio Massimo, Zeinalipour, Kamyar, Zugarini, Andrea
The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
- North America > United States > Montana (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (36 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Law (1.00)
- Health & Medicine (1.00)
- Information Technology (0.92)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Algorithmic Thinking Theory
Bateni, MohammadHossein, Cohen-Addad, Vincent, Gu, Yuzhou, Lattanzi, Silvio, Meierhans, Simon, Mohri, Christopher
Initial challenges, such as grade-school mathematics (GSM8K) and standard competition math (MATH dataset), have largely been surmounted, pushing the frontier of AI reasoning toward "grand challenge" problems, such as those found in the International Mathematical Olympiad (IMO). These problems, renowned for their demand for deep insight, creativity, and rigorous proof, expose a fascinating weakness in modern LLMs. While a model's performance on a single attempt (termed pass@1) may be very low, its ability to produce a correct answer within k attempts (pass@k) can be significantly higher. This pass@1 versus pass@k gap, especially pronounced when sampling with high temperature to produce diverse outputs, suggests that models possess a vast, latent capability that is not accessible in a single, high-confidence generation. Interestingly, to recover the full power of the model it is not sufficient to simply use multiple attempts. In fact, even the pass@k metric fails to capture the full story. On the most difficult problems, simply sampling k times and selecting the best answer (e.g., "best-of-32") still yields poor results. For instance, Huang and Yang (2025) report that a best-of-32 baseline on the IMO 2025 problems achieved an accuracy of only 31.6-38.1% for leading models [HY25]. This paradox lies at the heart of our work: the latent capability of LLMs is not merely a matter of selection (finding one correct needle in a haystack of k attempts), but one of synthesis.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > New York (0.04)
- (5 more...)
Generative AI for Self-Adaptive Systems: State of the Art and Research Roadmap
Li, Jialong, Zhang, Mingyue, Li, Nianyu, Weyns, Danny, Jin, Zhi, Tei, Kenji
Self-adaptive systems (SASs) are designed to handle changes and uncertainties through a feedback loop with four core functionalities: monitoring, analyzing, planning, and execution. Recently, generative artificial intelligence (GenAI), especially the area of large language models, has shown impressive performance in data comprehension and logical reasoning. These capabilities are highly aligned with the functionalities required in SASs, suggesting a strong potential to employ GenAI to enhance SASs. However, the specific benefits and challenges of employing GenAI in SASs remain unclear. Yet, providing a comprehensive understanding of these benefits and challenges is complex due to several reasons: limited publications in the SAS field, the technological and application diversity within SASs, and the rapid evolution of GenAI technologies. To that end, this paper aims to provide researchers and practitioners a comprehensive snapshot that outlines the potential benefits and challenges of employing GenAI's within SAS. Specifically, we gather, filter, and analyze literature from four distinct research fields and organize them into two main categories to potential benefits: (i) enhancements to the autonomy of SASs centered around the specific functions of the MAPE-K feedback loop, and (ii) improvements in the interaction between humans and SASs within human-on-the-loop settings. From our study, we outline a research roadmap that highlights the challenges of integrating GenAI into SASs. The roadmap starts with outlining key research challenges that need to be tackled to exploit the potential for applying GenAI in the field of SAS. The roadmap concludes with a practical reflection, elaborating on current shortcomings of GenAI and proposing possible mitigation strategies.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- (37 more...)
- Research Report (1.00)
- Overview (1.00)
- Health & Medicine (1.00)
- Government > Military (0.92)
- Transportation > Ground > Road (0.67)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
- (7 more...)
Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs
Mahran, Mariam, Simbeck, Katharina
Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solutions in each language. A held-out panel of LLM judges, including Claude 3.5 Haiku, evaluated solution quality using a comparative framework. Results show a consistent gap, with English solutions consistently rated highest, and Arabic often ranked lower. These findings highlight persistent linguistic bias and the need for more equitable multilingual AI systems in education.
- South America > Uruguay > Maldonado > Maldonado (0.04)
- North America > United States > Florida > Hillsborough County > University (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (4 more...)
MemOS: A Memory OS for AI System
Li, Zhiyu, Xi, Chenyang, Li, Chunyu, Chen, Ding, Chen, Boyu, Song, Shichao, Niu, Simin, Wang, Hanyu, Yang, Jiawei, Tang, Chen, Yu, Qingchen, Zhao, Jihao, Wang, Yezhaohui, Liu, Peng, Lin, Zehao, Wang, Pengyuan, Huo, Jiahao, Chen, Tianyi, Chen, Kai, Li, Kehang, Tao, Zhen, Lai, Huayi, Wu, Hao, Tang, Bo, Wang, Zhengren, Fan, Zhaoxin, Zhang, Ningyu, Zhang, Linfeng, Yan, Junchi, Yang, Mingchuan, Xu, Tong, Xu, Wei, Chen, Huajun, Wang, Haofen, Yang, Hongkang, Zhang, Wentao, Xu, Zhi-Qin John, Chen, Siheng, Xiong, Feiyu
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (11 more...)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.93)
- Law (0.93)
- Government (0.67)
Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models
Liu, Xin, Song, Qiyang, Zhou, Qihang, Du, Haichao, Xu, Shaowen, Jiang, Wenbo, Zhang, Weijuan, Jia, Xiaoqi
Meanwhile, efforts to interpret their internal mechanisms have emerged, offering insights to enhance multilingual performance. While multi-head self-attention (MHA) has proven critical in many areas, its role in multilingual capabilities remains underexplored. In this work, we study the contribution of MHA in supporting multilingual processing in LLMs. We propose Language Attention Head Importance Scores (LAHIS), an effective and efficient method that identifies attention head importance for multilingual capabilities via a single forward and backward pass through the LLM. Applying LAHIS to A ya-23-8B, Llama-3.2-3B, and Mistral-7B-v0.1, we reveal the existence of both language-specific and language-general heads. Language-specific heads enable cross-lingual attention transfer to guide the model toward target language contexts and mitigate off-target language generation issue, contributing to addressing challenges in multilingual LLMs. We also introduce a lightweight adaptation that learns a soft head mask to modulate attention outputs over language heads, requiring only 20 tunable parameters to improve XQuAD accuracy. Overall, our work enhances both the interpretability and multilingual capabilities of LLMs from the perspective of MHA.
- Leisure & Entertainment > Sports > Football (0.46)
- Information Technology (0.46)
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Purason, Taido, Chizhov, Pavel, Yamshchikov, Ivan P., Fishel, Mark
Tokenizer adaptation plays an important role in transferring pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training, which adapts a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source package.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (15 more...)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.30)
ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction
Yu, Yan, Liu, Yilun, He, Minggui, Tao, Shimin, Meng, Weibin, Yang, Xinhua, Zhang, Li, Ma, Hongxia, Li, Dengye, Wei, Daimeng, Chen, Boxing, Li, Fuliang
Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks, yet non-transitive preferences, where evaluators prefer A over B, B over C, but C over A, fundamentally undermine ranking reliability. We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs. To address this challenge, we propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs and systematically identifies problematic training data. ELSPR quantifies non-transitivity through strongly connected components (SCCs) analysis and measures overall preference clarity using a novel normalized directed graph structural entropy metric. Our filtering methodology selectively removes preference data that induce non-transitivity while preserving transitive preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that models fine-tuned on ELSPR-filtered data achieve substantial improvements: a 13.8% reduction in non-transitivity, a 0.088 decrease in structural entropy, and significantly enhanced discriminative power in real-world evaluation systems. Human validation confirms that discarded data exhibit dramatically lower inter-annotator agreement (34.4% vs. 52.6%) and model-human consistency (51.2% vs. 80.6%) compared to cleaned data. These findings establish ELSPR as an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems.